Hardware implementation of memristor-based artificial neural networks

Artificial Intelligence (AI) is currently experiencing a bloom driven by deep learning (DL) techniques, which rely on networks of connected simple computing units operating in parallel. The low communication bandwidth between memory and processing units in conventional von Neumann machines does not support the requirements of emerging applications that rely extensively on large sets of data. More recent computing paradigms, such as high parallelization and near-memory computing, help alleviate the data communication bottleneck to some extent, but paradigm- shifting concepts are required. Memristors, a novel beyond-complementary metal-oxide-semiconductor (CMOS) technology, are a promising choice for memory devices due to their unique intrinsic device-level properties, enabling both storing and computing with a small, massively-parallel footprint at low power. Theoretically, this directly translates to a major boost in energy efficiency and computational throughput, but various practical challenges remain. In this work we review the latest efforts for achieving hardware-based memristive artificial neural networks (ANNs), describing with detail the working principia of each block and the different design alternatives with their own advantages and disadvantages, as well as the tools required for accurate estimation of performance metrics. Ultimately, we aim to provide a comprehensive protocol of the materials and methods involved in memristive neural networks to those aiming to start working in this field and the experts looking for a holistic approach.


Supplementary note 1: Advanced mapping methods, bit-slicing, and tiling
In most cases of mapping neural networks to memristor-based hardware, it is not enough to convert the weights to conductance values.Circuit and architecture designers often have to deal with a limited number of conductance states in memristive devices and non-linear conductance distribution, optimization of the crossbar sizes, mapping large weights matrices into smaller crossbars, and mapping different types of neural network layers with different distributions of weights.All these challenges have corresponding solutions related to crossbar mapping techniques.
One of the most common techniques to map high-precision ANN weights into lowprecision memristive devices is called bit-slicing or weight-slicing [1], [2] Bit-slicing implies that n-bit neural network weights are represented by 2 !"# m-bit memristive devices (devices with 2 # conductance levels).This can be done due to the following reasons: (1) when the number of stable conductance levels in memristive devices is limited and lower than the required weight precision, (2) when low-precision ADC is used and cannot support highresolution crossbar weights, or (3) when the device-to-device or cycle-to-cycle conductance variation of memristors is too high and can affect the accuracy of VMM operation.A lower number of bits per device can decrease the effect of conductance variation on the VMM output.However, it also reduces the storage density of the crossbar, and, in turn, increases on-chip area due to the increased number of devices that should be used per single synapse [1].
Concerning the mapping of positive and negative weights, bit slicing has proven to be very useful when combined with 2's complement codification of the binarized synaptic weights [3]- [8].In this approach, it is possible to avoid the differential codification using two highprecision (i.e. with many well defined levels) memristors, and to replace them by a single array of low-precision memristors per synaptic weight.A different approach for mapping has been suggested in [9], where the training process involves constraining twice the number of weights to be nonnegative and associating them with individual devices.This method results in a more natural connection between neural network parameters and conductances, and also permits weight regularization to be utilized as a way of reducing power consumption.
The other challenge in memristor-based neural network hardware designs is to create generalized designs fitting various sizes of weight matrices and preserving efficient hardware utilization.This often implies a fixed size of the crossbars throughout the design.In turn, large weight matrices cannot be mapped into a single crossbar array.Therefore, a technique called tiling is used to divide a weight matrix and map it to several crossbar tiles or sub-crossbars [1], [10].There are several reasons to apply tiling: (1) when mapping large weight matrices to the hardware, which cannot fit into a single crossbar array, (2) when reducing required ADC resolution for a large crossbar, (3) when optimizing the size of the crossbar to avoid IR drop due to interconnects and long crossbar wires.
Same to the bit-streaming technique, both bit-slicing and tiling produce additional hardware overhead required to calculate partial sums [1].In the case of tiling, partial sums are produced by each tile or sub-crossbar, and then accumulated to form a final VMM value using adder circuits.In the case of bit-slicing, several memristive devices representing a single weight are arranged to neighbouring columns of a single crossbar or multiple crossbar tiles.When all devices representing a single weight are in the same crossbar, multiplexers and sample-andhold circuits are used to process and store the partial sums before calculating the final VMM value.When the components of a single weight are stored in different crossbar tiles, the partial sums from these tiles can be read in parallel but still require an adder circuit.
Non-idealities of memristive devices can also affect weight mapping.For example, the non-linear conductance distribution of the device should be considered, when quantizing and converting neural network weights to the conductance levels.If the example shown above assumes that it is possible to achieve linear and uniform conductance distribution, most of the memristive devices experience non-linearly distributed conductance states.In this case, quantized weights should be mapped to non-uniformly distributed discrete states before mapping to the conductance values.Moreover, advanced mapping methods can be used for alleviating the effect of memristor non-idealities [1].For example, ANN weights more sensitive to variations and defects can be mapped to defect-free or low-variation memristive cells in a crossbar [11], or closer to the voltage source to be affected less by IR drop occurring in the crossbar columns further from the voltage source due to wire resistances [1].
In addition, for more complex neural networks than ANN, the distribution of the weights in different layers may vary.For example, weight distribution in convolution layers in CNNs is different from fully-connected layers.Therefore, such layers require additional design considerations, including unrolling convolution kernels, and rearranging or duplicating the weights to achieve optimized hardware design [12], [13].

Supplementary Note 2: Memristor variability and thermal heating problem
Nanoscale size and non-volatility make memristor devices advantageous in implementation of fast, area and energy-efficient neural network accelerators.Typically, memristor are arranged into crossbar arrays and are stacked into 2.5D and 3D heterogeneous architectures [14].However, downscaling of the feature size F of memristor and their utilization in dense architectures also make them susceptible to thermal heating problem.In particular, miniaturization of the NiO device from 100 nm to 30 nm led to increase of temperature from 400K to 1800K [15].Moreover, increase of temperature leads to decrease of the expected shortest lifetime (ESL) [14] of a memristor and shrinking it's a  $! / $%% ratio leading to low reliability and data loss [16].In addition, a hot "aggressor" cell negatively affects the performance of surrounding "victim" cells resulting in thermal cross-talk problem.Their sensitivity level depends on the resolution of the device, crossbar array pitch length, materials used in conducting filament (CF) and electrodes.Similarly, heterogeneous parts of the hardware also have different thermal densities and cause inter-die thermal coupling and hotspots as shown in Supplementary Figure 1 for ISAAC [17], [18].As a result, in 2.5D design lifetime of a memristor is close to ESL, whereas in 3D design it reduces below 2.6 years [14].In other words, increase of temperature zoom in malfunctions associated with device-to-device and cycle-to-cycle variability.

Supplementary Figure 1. Power density in ISAAC.
It is important to note, that due to its cumulative nature, thermal effects have more impact on memristor-based ANN accelerators rather than memristor-based storage devices.In addition, different workloads result in different thermal distribution.For instance, the temperature difference between VGG16, InsecptionV3 and ResNet50 neural networks mapped to the same crossbar arrays can reach up to 17 ºK, as illustrated in Supplementary Figures 2a,  2b and 2c, respectively [19].Another work demonstrated that thermal problems can be mitigated via memristors static allocation considering the thermal distribution.Comparison of "naïve", "strike" and "chess" allocation schemes on memristor crossbar arrays for the same workload allowed to reduce the temperature difference between neighbouring cells down to 8 ºK as seen in Supplementary Figures 3a, 3b and 3c, respectively [20].This became the basis of numerous studies addressing the thermal challenges in memristor-based accelerators.The proposed solutions presented different re-mapping and optimization schemes at weight [19], [21] row/column [22], [23], subarray and arrays levels [19], [24].All of these re-mapping solutions rely on having a thermal heat map of crossbar array, the neural network model and the hardware characteristics.To construct such a heat map, synaptic weights of a neural network are first mapped to the memristor crossbar arrays using the "naïve" approach.Then, the temperature of each memristor is sensed by individually assigned sensors (Ideal Thermal Sensors -TPS-Ideal-approach) and ranges between 273 ºK and 400 ºK.According to the empirical results, the critical temperature above which memristor performance starts to degrade is ~340 ºK.Therefore, a threshold temperature typically is set to 330 ºK.To reduce redundancy of temperature sensors, a limited TPS approach can be adopted.Here, temperature is collected at the hottest points of the memristor crossbar and temperature of other rows is estimated using the equation below:

Supplementary
where and  '×()+,) are consecutively obtained temperatures of the two rows;  ×  ≤  ≤  × ( + 1).In Hybrid Memory Cube (HMC) design, temperature sensors are placed at the centre and corners of the memristor crossbars [23].Knowing the heat distribution in the memristor crossbars, the following remapping approaches can be applied to mitigate its effects.In temperature-aware weight adjustment (TAWA) scheme [22], when the temperature measured or estimated for a memristor cell is above 330K (temperature threshold), the cell is considered as "hot" and the weight mapped to it is pruned.

B) Temperature-aware row adjustment in a crossbar arrays
Here, effective rows of a neural network are mapped to cold rows of a crossbar array, whereas ineffective rows are mapped to hot rows as in Supplementary Figure 4. Neural network matrix rows are classified to effective and ineffective rows using Summed Weight Variations (SWV) metric and predefined threshold .Memristor crossbar array rows are categorized as cold if temperature is below threshold 330 ºK, otherwise rows are hot.The Summed Weight Variations (SWV) metric can be estimated based on equation below: where q is the row of  ×  memristor crossbar array;  .& is the weight at location (p,j) and  /& is the actual weight at location (q,j).SWV value changes with temperature.If  ./> , rows are considered as effective, otherwise as ineffective.Different applications require different levels of  [23].

C) Weight decomposition
The conventional way to map both positive and negative weights of a neural network on memristor crossbar array is to split it to two crossbar arrays in a way that the absolute value of a positive weight is stored in positive array  .$2and the absolute value of a negative weights is stored in a negative array  !34 .Then, analog subtraction of the values is done.There are 2 5 −  ways to decompose N bit value V into  .$2and  !34 .Temperature of a memristor crossbar array is proportional to the applied voltage.Besides, the sum of partial weights stored in each cell is proportional to the applied voltage too.Therefore, the thermal-aware decomposition strategy aims to find the case with the smallest sum of partial weights: ( − 3) S 7,$.9 .$2,  7,$.9 !34 T = arg min Here, k is the decomposition case for weight value V;  6 .$2(, ) and  6 !34 (, ) i th positive and negative partial weights of the k th decomposition case for weight V.
To reduce temperature in crossbar array, the weight decomposition technique called TOPAR I generates all possible decomposition cases with corresponding cost (, ).For example, if a weight value V=15 should be represented as an 8-bit value using four 2-bit memristor cells, there are 241 decomposition cases such as (15,0), (16,1), (17,2), and so on.Based on Equation 3, it can be found that the decomposition case (16,1) that can be represented using eight 2-bit cells 00,01,00,00,00,00,00,01 has the minimum cost (16, 1) = 2 which is also illustrated in Supplementary Figure 5.

Supplementary Figure 5. Weight decomposition with the lowest temperature D) Thermal-aware column re-ordering
Temperature-aware column re-ordering is typically used after weight decomposition.It aims to reduce the temperature difference between arrays by shuffling the order of pairs.Since a single weight is represented by a pair of two weights located at  .$2and  !34 , column reordering should take this into consideration and therefore difference between positives and negative arrays of  9: re-ordering pair: where  ; the number of crossbar elements;  6 .$2(, ) and  6 !34 (, ) are  9: partial weight in positive and negative array of each pair.Column re-ordering is an optimization problem that aims to minimize the sum of the partial weights in each array  <== as shown in Supplementary Figure 6.

F) Bit-width downgrading
The bit-width downgrading scheme [19] is applied if the sensed temperature of a memristor exceeds the threshold temperature of 330 ºK.A typical weight to conductance encoding scheme is explained in detail in section 2.3 and equation 4 from the main text.Yet, for the sake of clarity we can re-write it as: where  = > "#$ +> "%& ?"#$ "? "%& and  =  #<@ −  *  #<@ .To downgrade the bit width of a hot cell a new conductance value is calculated: where N represents the number of bits shifted right.Then a memristor states should be encoded with a new conductance value.In [] when downgrade bit signal is obtained, there are two cases.If it makes change from 0 to 1, the system recomputed the conductance with N bit shift and re-programmes the cells.If change is made from 1 to 0, the system performs weight restoration and programmes original weights to memristors.If bit downgrading took place, output of crossbar array is shifted N bits back using shift-and-add unit.

Supplementary Figure 7. Fine-grained weight adjustment G) Tile-pairing
The idea of tile pairing is pairing overheated "master" tile with cooled-down "slave" pair to reduce the power consumption in the hot tile [19].The unpaired tile is considered overheated in case that 80% of the crossbar's memristors reach the threshold temperature.Crossbar arrays in both master and slave tiles use the same weights  6& and same input  6 .The output of master crossbar array is  $A9 # = d 1 ,  B,..,  B5 f and the slave crossbar is  $A9 2 = d , ,  D,..,  B5+, f , respectively.The output voltage can be found as:

H) Temperature-aware weights remapping in subarrays
According to [21], shallow layers (L2-L4) of a VGG-11 neural network model have more impact on the final result than deep layers (L8-L10).To preserve the accuracy of neural model, the weights of shallow layers should be mapped on sub-arrays with lower temperature.To implement weights re-mapping on crossbar array a special placement strategy is used [21].Therefore, weights of each layer are divided into smaller weights subsets to fit the size of a subarray and the required number of subarray is calculated.Then, temperature of each subarray is summed and stored.The aim of placement function is to place adjacent weights subsets as close as possible to reduce interconnection delays with consideration of a generated thermal heat map of crossbar array.When the function is applied, a "weight map" with the lowest temperature distribution is generated.

I) Weight pruning and splitting (WPS) in subarrays
Weight pruning and splitting technique is used to reduce temperature of hot subarrays in the cases when the accuracy is below a threshold ℎ G;; (determined by a designer) and weight remapping did not help [21].Decision on which weights should be pruned is made based on criticality parameter.Since neural network layers have different impact on the final output, pruning ratio Prune_ratio of first layers should be smaller than the ratio of the deep layers.Besides, weights of each layer i have their own level of importance.The criticality of each subset is evaluated using absolute sum of the weights and subsets are sorted in a list based on their criticality level from high to low.First half of the list is classified as "critical" and the other half as "less-critical".Pruning top critical weights has more negative impact on the accuracy than pruning less-critical weights.After each pruning iteration  is subtracted from a initial Prune_ratio=80% and repeated until the obtained accuracy is higher than ℎ G;; .Pruning generates unused subarrays and decreases power consumption, but also decreases the accuracy.Therefore, the pruned model is retrained and a new weight map is generated.Splitting of critical weights to lower conductance values is applied in the case of the sufficient number of unused subarrays.For instance, an 8-bit weight is mapped to two 4-bit memristor cells.For the decimal value V=236 , cells should store 14 (1110) and 12 (1100).However, due to high temperature both cells store 8 (1000) that corresponds to a wrong V=136.Therefore, splitting conductance level will allow to restore the value V=236 using four 4-bit cells and lower half of the conductance range corresponding to 8, 6, 8 and 4 as shown in Supplementary Figure 8.

J) Weight compensation (WC) in subarrays
It is known that temperature affects weights with higher conductance.One of the ways to reduce weight value is shifting it right and restoring the output by shifting it left as was suggested in bitwidth downgrading in [19].For example, the weight value V=7 and input voltage is 1.If three MSB levels are affected by hight temperature, they can be shifted right, and the weight value becomes V=4.The multiplication output is equal to 4 and shifted left giving compensated value V=8 which is close to 7. The use of such weight compensation techniques is suggested only if weight remapping and WPS methods did not help [21].Supplementary Algorithm 2. Code of the k-fold cross-validation for MATLAB.Note that the part of the code that performs the training in each iteration is identical to that one used for training the SLP, except that in this case, the parameters trainFcn, trainRatio and testRatio are passed as arguments to the training code shown in Algorithm 3  Supplementary Algorithm 5. SPICE netlist of a 4×4 RRAM-based single layer perceptron implemented using 2 partitions for both the positive and negative crossbars of memristor arrays.The netlist include the CMOS blocks required for the inference as well as the and write phase..globalgnd! vdd! vss! 7.

39.
*First layer interconnect lines with series resistance 40.

for i=1: 2 % 9 .Supplementary Figure 10 .
The process is repeated for the positive and negative CPA for G_idx2=1:10 % This iterates over the rows of the 8*8x10 CPA for G_idx1=1:8*8 % This iterates over the columns of the 8*8x10 CPA % This defines an equation of the form I_ij=I(V,lambda_ij), where lambda_ij % is the variable to find.V is the read voltage.I_ij is calculated as % g_ij*V eqn= I_matrix(G_idx1,G_idx2,i)==(Imax*lambda+Imin*(1lambda))*(exp(beta*(alphamax*lambda+alphamin*(1-lambda))*(V-((rsmax*lambda+rsmin*(1-lambda)))*I_matrix(G_idx1,G_idx2,i)))-exp(-(1beta)*(alphamax*lambda+alphamin*(1-lambda))*(V-((rsmax*lambda+rsmin*(1lambda)))*I_matrix(G_idx1,G_idx2,i)))); % This solves equation eqn for variable lambda using the function vpasolve, % and stores the lambda value in the matrix W_init, in the position %(G_idx1,G_idx2,i); W_init(G_idx1,G_idx2,i)=double(vpasolve(eqn,lambda)); (a) Database creation and ideal software ANN generation and training.The last phase shows the computation of the memristors conductance values.(b) Partitioning of the matrix of synaptic weights (or control parameters λ).The possible correction of these values is performed at this point (if required) and the resulting values of the control parameter are then passed to the routine than creates the SPICE netlist for each partition, considering the line resistance, RRAM model, and connection scheme.(c) Partitions are combined to render the full circuit.Reconfigurable logic is added to switch between inference and write configurations.(d) Weight programming and inference simulations.Metrics extraction.Simplified sketch for a partitioned crossbar-array based single layer perceptron.The crossbar array is subdivided into N identically sized partitions to minimize the parasitic voltage drops and placed in the Back-End-Of-Line (BEOL).The circuit peripherals could potentially be placed below them, in the Front-End-Of-Line (FEOL).Partial output current vectors are indicated in the output of each partition.

%% This part performs the k-fold analysis.
Creation and training of a SLP for the classification of the MNIST images rescaled to 8×8 pixels (SLP of size 64×10).The network is created as a MATLAB net object, which includes properties such as the network number of inputs, biases, synaptic weights, neurons activation function, database distribution among test, train and validation samples.MATLAB code that solves the transport equation to determine the value of the control parameter λ for each of the memdiode devices.